117 research outputs found
SĂ©parateurs Ă Vaste Marge pondĂ©rĂ©s en norme l2 pour la sĂ©lection de variables en apprentissage dâordonnancement
National audienceLearning to rank algorithms are dealing with a very large amount of features to automatically learn ranking functions, which leads to an increase of both the computational cost and the number of noisy redundant features. Feature selection is seen as a promising way to address these issues. In this paper, we propose new feature selection algorithms for learning to rank based on reweighted l2 SVM approaches. We investigate a l2-AROM algorithm to solve the l0 norm optimization problem and a generic l2-reweighted algorithm to approximate l0 et l1 norm SVM problems with l2 norm SVM. Experiments show that our algorithms are up to 10 times faster and use up to 7 times less features than state-of-the-art methods, without lowering the ranking performance.Les algorithmes dâapprentissage dâordonnancement utilisent un trĂšs grand nombre de caractĂ©ristiques pour apprendre les fonctions dâordonnancement, entraĂźnant une augmentation des temps dâexĂ©cution et du nombre de caractĂ©ristiques redondantes ou bruitĂ©es. La sĂ©lection de variables est une mĂ©thode prometteuse pour rĂ©soudre ces enjeux. Dans cet article, nous pro- posons de nouvelles mĂ©thodes de sĂ©lection de variables en apprentissage dâordonnancement basĂ©es sur des approches de pondĂ©ration des SVM en norme l2. Nous proposons une adap- tation dâune mĂ©thode l2-AROM pour la rĂ©solution des SVM en norme l0 et un algorithme gĂ©nĂ©rique de pondĂ©ration de la norme l2 qui rĂ©sout les problĂšmes en norme l0 et l1. Nos ex- pĂ©rimentations montrent que les mĂ©thodes proposĂ©es sont jusquâĂ 7 fois plus rapides et 10 fois plus parcimonieuses que lâĂ©tat de lâart, pour des qualitĂ©s dâordonnancement Ă©quivalentes
Learning to Choose : automatic Selection of the Information Retrieval Parameters
International audienceIn this paper we promote a selective information retrieval process to be applied in the context of repeated queries. The method is based on a training phase in which the meta search system learns the best parameters to use on a per query basis. The training phase uses a sample of annotated documents for which document relevance is known. When an equal-query is submitted to the system, it automatically knows which parameters it should use to treat the query. This Learning to choose method is evaluated using simulated data from TREC campaigns. We show that system performance highly increases in terms of precision (MAP), speci cally for the queries that are di cult to answer, when compared to any unique system con guration applied to all the queries
Outils pour chercher de l'information sur R et se former
National audienceDans cette proposition de communications, nous nous proposons de faire le tour de ressources disponibles en ligne pour rechercher des informations sur R, son installation, son utilisation ainsi que de celles qui permettent de se former. Notre ambition n'est pas de fournir une liste exhaustive de ces ressources mais, devant le foisonnement et le développement de sites web, blogs et ressources diverses concernant le logiciel, de faire un descriptif organisé de celles que nous avons utilisées ou appréciées
Ăvaluation de la pertinence dans les moteurs de recherche gĂ©orĂ©fĂ©rencĂ©s
National audienceLearning to rank documents on a search engine requires relevance judgments. We introduce the results of an innovating study on relevance modeling for local search engines. These search engines present search results on a map or as a list of maps. Each map contains all the attributes of a place (noun, address, phone number, etc). Most of these attributes are links users can click. We model the relevance as the weighted sum of all the clicks on a result. We obtain good results by fixing the same weight for each component of the model. We propose a relative order between clicks to determine the optimal weights.Optimiser le classement des rĂ©sultats dâun moteur par un algorithme de learning to rank nĂ©cessite de connaĂźtre des jugements de pertinence entre requĂȘtes et documents. Nous prĂ©sentons les rĂ©sultats dâune Ă©tude pilote sur la modĂ©lisation de la pertinence dans les moteurs de recherche gĂ©orĂ©fĂ©rencĂ©s. La particularitĂ© de ces moteurs est de prĂ©senter les rĂ©sultats de recherche sous forme de carte gĂ©ographique ou de liste de fiches. Ces fiches contiennent les caractĂ©ristiques du lieu (nom, adresse, tĂ©lĂ©phone, etc.) dont la plupart sont cliquables par lâutilisateur. Nous modĂ©lisons la pertinence comme la somme pondĂ©rĂ©e des clics sur le rĂ©sultat. Nous montrons quâĂ©quipondĂ©rer les diffĂ©rents Ă©lĂ©ments du modĂšle donne de bons rĂ©sultats et quâun ordre dâimportance entre type de clics peut ĂȘtre dĂ©duit pour dĂ©terminer les pondĂ©rations optimales
Unravelling 'omics' data with the R package mixOmics
Unravelling 'omics' data with the R package mixOmic
Performance Analysis of Information Retrieval Systems
International audienceIt has been shown that there is not a best information retrieval system configuration which would work for any query, but rather that performance can vary from one query to another. It would be interesting if a meta-system could decide which system should process a new query by learning from the context of previously submitted queries. This paper reports a deep analysis considering more than 80,000 search engine configurations applied to 100 queries and the corresponding performance. The goal of the analysis is to identify which search engine configuration responds best to a certain type of query. We considered two approaches to define query types: one is based on query clustering according to the query performance (their difficulty), while the other approach uses various query features (including query difficulty predictors) to cluster queries. We identified two parameters that should be optimized first. An important outcome is that we could not obtain strong conclusive results; considering the large number of systems and methods we used, this result could lead to the conclusion that current query features does not fit the optimizing problem
Quinze ans de recherche appliquée en science des données
Ce mĂ©moire synthĂ©tise quinze ans dâactivitĂ©s scientifiques Ă lâInstitut de MathĂ©matiques de Toulouse. Il fait Ă©tat de mon rĂŽle dans des travaux de recherche interdisciplinaires autour de lâanalyse de donnĂ©es. Dans ce cadre, au-delĂ de la mise en Ćuvre de mĂ©thodes statistiques, câest toute une mĂ©thodologie que jâai dĂ©veloppĂ©e pour exploiter au mieux et valoriser des donnĂ©es. Ainsi, aprĂšs avoir livrĂ© quelques rĂ©flexions sur la notion de donnĂ©e dans un premier chapitre, je consacre le deuxiĂšme chapitre Ă lâĂ©laboration dâune mĂ©thodologie de travail dans le cadre de collaborations interdisciplinaires. Jâillustre sa construction et sa mise en Ćuvre Ă travers plusieurs cas dâĂ©tude liĂ©s notamment Ă lâanalyse de donnĂ©es issues de bio-technologies Ă haut-dĂ©bit. Cette mĂ©thodologie sâĂ©tend de la formulation dâune question prĂ©cise Ă lâinterprĂ©tation des rĂ©sultats dâune mĂ©thode statistique permettant potentiellement dây rĂ©pondre. Elle sâintĂšgre naturellement dans ce quâil est devenu courant dâappeler la science des donnĂ©es. Le troisiĂšme chapitre se focalise sur ma thĂ©matique privilĂ©giĂ©e : lâintĂ©gration de donnĂ©es. Ce thĂšme de recherche vise Ă dĂ©velopper des dĂ©marches ou des mĂ©thodes visant Ă extraire une information plus pertinente en analysant globalement plusieurs jeux de donnĂ©es plutĂŽt quâen les analysant sĂ©parĂ©ment. Cette thĂ©matique est illustrĂ©e dâabord dans le cadre de la recherche dâinformation puis dans celui de lâanalyse de donnĂ©es biologiques. Dans ce dernier cas, jâai contribuĂ© au dĂ©veloppement de nouvelles mĂ©thodes statistiques ainsi quâĂ leur dissĂ©mination auprĂšs de la communautĂ© des biologistes. Pour cela, jâai rĂ©guliĂšrement supervisĂ© la mise en Ćuvre de ces nouvelles mĂ©thodes dans des projets de recherche, jâai encadrĂ© des Ă©tudiants en thĂšse et master et jâai Ă©galement contribuĂ© Ă la mise Ă disposition dâoutils logiciels pour lesquels jâai aussi assurĂ© des actions de formation. Enfin, le quatriĂšme chapitre est consacrĂ© Ă mes activitĂ©s de soutien Ă la recherche
Visual clustering for data analysis and graphical user interfaces
International audienceCluster analysis is a major method in data mining to present overviews of large data sets. Clustering methods allows dimension reducing by finding groups of similar objects or elements. Visual cluster analysis has been defined as a specialization of cluster analysis and is considered as a solution to handle complex data using interactive exploration of clustering results. In this chapter, we consider three cases studies in order to illustrate cluster analysis and interactive visual analysis. The first case study is related to information retrieval field and illustrates the case of multi-dimensional data in which objects to analyze are represented considering various features or variables. Evaluation in information retrieval considers many performance measures. Cluster analysis is used to reduce the number of measures to a small number that can be used to compare various search engines. The second case study considers networks in which data to analyze is represented in the form of matrices that correspond to adjacency matrices. The data we used is obtained from publications; cluster analysis is used to analyze collaborative networks. The third case study is related to curve clustering and applies when temporal data is involved. In this case study, the application is time series gene expression. We conclude this chapter by presenting some other types of data for which visual clustering can be used for analysis purposes and present some tools that implement other visual analysis functionalities we did not present in the case studies
Contribution of an integrative study to the understanding of plant adaptation to their environment: A focus on plant cell walls.
National audienc
- âŠ